Link to the live webpage: https://dungngotan99.github.io/
Since the first time Airbnb was established in 2008, its appearance has disrupted and changed the traditional ways of hospitality and acoomodation around the world. The number of rental listings on Airbnb has expanded exponentially over the past few years. Based on the motivation to understand its explosive growth and operation thoroughly, this project wiil extract data from websites, manipulate it, run exploratory analysis on procesed one, apply machine learning models to do so. Besides, in the stance of owners, I will try to answer more questions as many as possible to help hosts have the clearest view of Airbnb company and how to strategize their business. Indeed, the project will explore how Airbnb in different cities affect how owners run their business as hosts and how customers will feel and speed their money on Airbnb houses. Finally, with all insights we collect, possible ideas will be suggested to help owners to choose the best options they need.
The project will try to draw a comprehensive picture of Airbnb listings in New York City and then compare its with those of other cities, such as San Francisco or Boston.
In details, the project will understand how the locations of these accomodations, ranges of time customers are likely to travel (supply and demand principle), and other reasons may affect the price, ratings, and number of listings. Furthermore, customers' reviews play an important roles in improving customer service, which may help the company run advertisement for suitable targets and connect customers or hosts with same interests. Thus, I want to explore how these reviews can be extracted and briefly identify some characteristics of customers to personalize their experiences.
Further questions:
Python libraries:
1. Data extraction and manipulation: numpy, pandas
2. Data visualization: plotly, seaborn, ipywidgets, and matplotlib
3. Data analysis: scipy and scikit-learn
Others: Natural language processing (NLP) and linear regression.
import pandas as pd
import numpy as np
import scipy.stats as stat
import matplotlib.pyplot as plt
import seaborn as sb
from pygal import *
from ipywidgets import *
import geopandas
from shapely.geometry import Point, Polygon
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
from pandas.plotting import register_matplotlib_converters
from ipywidgets.embed import embed_minimal_html
register_matplotlib_converters()
sb.set()
# The adjustment below will widen the notebook and lets us display data easily.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
We load the datasets locally, and it consists of 3 cities in the U.S, include Boston, New York, and San Francisco. First of all, we will clean and explore more deeply New York datasets through graphs, statistics, and machine learning model. Later, we will compare it with the two remaining processed datasets. Also, we will take a look at the dimension of NY dataset for later analysis.
# Boston datasets
listing = "../Airbnb-Datasets/Boston/listings.csv"
boston_listings = pd.read_csv(listing)
# San Francisco datasets
listing = "../Airbnb-Datasets/San Francisco/San francisco.csv"
san_listings = pd.read_csv(listing)
# New York datasets
listing_kaggle = "../Airbnb-Datasets/New York/AB_NYC_2019.csv"
listing = "../Airbnb-Datasets/New York/airbnb-listings.csv"
ny_listing_kaggle = pd.read_csv(listing_kaggle)
ny_listings = pd.read_csv(listing, sep=';',low_memory=False)
print("NY dataset's dimension: %dx%d" % (ny_listings.shape))
If we look at the NY dataset's dimension, we could see that the see that it is huge and we should be happy about it! However, it comes with a dark side because it makes harder for us to remember and choose the best columns for data analysis. Therefore, we will divide this dataset into 6 sub-categories, including text, host, location, house, money, and review. Below are the information about each table.
#ny_listings = ny_listings.set_index("ID")
# Divide the dataset into 6 sub-categories, including text, host, location, house, money, and review
# All shared same ID
text = ['Description','Summary','Experiences Offered', 'Neighborhood Overview', 'Notes']
host = ['Host ID', 'Host Since', 'Host Location', 'Host About','Host Listings Count',
'Host Response Time', 'Host Response Rate','Host Verifications']
location = ['Street', 'Neighbourhood', 'City','State', 'Zipcode', 'Market','Neighbourhood Group Cleansed',
'Neighbourhood Cleansed','Smart Location', 'Country Code','Country', 'Latitude', 'Longitude']
house = ['Property Type', 'Room Type','Accommodates', 'Bathrooms', 'Bedrooms', 'Beds',
'Bed Type','Amenities', 'Square Feet','Calculated host listings count']
money = ['Price', 'Weekly Price', 'Monthly Price', 'Cleaning Fee', 'Availability 30',
'Availability 60','Availability 90', 'Availability 365']
review = ['Number of Reviews', 'First Review', 'Last Review','Review Scores Rating','Reviews per Month',
'Review Scores Accuracy','Review Scores Cleanliness', 'Review Scores Checkin',
'Review Scores Communication', 'Review Scores Location','Review Scores Value']
text_df, host_df, location_df = ny_listings[text].copy(), ny_listings[host].copy(), ny_listings[location].copy()
house_df, money_df, review_df = ny_listings[house].copy(), ny_listings[money].copy(), ny_listings[review].copy()
Fortunately, the datasets we downloaded online are tidy and did not have any serious quality issues. However, if we skim over it, we can easily spot out the missing data. Thus, we will replace missing values with its mean for numerical values if the missing values account for more than 30% of their total elements. Otherwise, we will just drop them all. For categorical ones, we will fill missing data with "Not Provided" string if the missing values are more than 50% of total elements.
# Create a helper function to replace missing values
def fill_missing(df):
columns = df.columns.values
for each in columns:
column, name = df[each], df[each].name
if column.dtypes == 'int64':
miss_percent = column.isnull().sum()/len(column) > 0.3
df.fillna(value={name:column.mean()},inplace=True) if miss_percent else df.dropna(subset=[each],inplace=True)
elif column.dtypes == 'float64':
miss_percent = column.isnull().sum()/len(column) > 0.3
df.fillna(value={name:column.mean()},inplace=True) if miss_percent else df.dropna(subset=[each],inplace=True)
elif column.dtypes == 'object':
miss_percent = column.isnull().sum()/len(column) > 0.5
df.fillna(value={name:'Not provided'},inplace=True) if miss_percent else df.dropna(subset=[each],inplace=True)
df_lists = [text_df, host_df, location_df, house_df, money_df, review_df]
for each in df_lists:
fill_missing(each)
First of all, we are interested in finding the correlation among features. For example, we want to see the correlation between number of reviews for each listing and its price or the host response rate and review scores rating by using scatter plot and critical lines.
# Display the Money dataframe
money_df.head()
# Display the Review dataframe
review_df.head()
# Create a figure and 2 subplots
fig, axes = plt.subplots(nrows=1,ncols=2,figsize=(20,10), dpi=200)
#Price and review per month
money_review = money_df.merge(review_df, how='inner', left_index=True, right_index=True)
axes[0].scatter('Number of Reviews','Price',data=money_review, s=money_review['Price']/10,c='b',alpha=0.5)
axes[0].set(xlabel='Number of reviews',ylabel='Price', title='Correlation between number of reviews and price')
#Annotate the axis lines with their means
axes[0].axhline(money_review['Price'].mean(),c='black')
axes[0].text(250,200,'Mean: {}'.format(round(money_review['Price'].mean())))
axes[0].axvline(money_review['Number of Reviews'].mean(),c='black')
axes[0].text(25,1000,'Mean: {}'.format(round(money_review['Number of Reviews'].mean())))
#Host response rate and review scores rating
host_review = host_df.merge(review_df, how='inner', left_index=True, right_index=True)
for each in host_review['Host Response Time'].unique():
data = host_review[host_review['Host Response Time'] == each]
axes[1].scatter('Host Response Rate','Review Scores Rating',data=data,
label=each,s=host_review['Host Response Rate'],alpha=0.5)
axes[1].set(xlabel='Host Response Rate',ylabel='Review Scores Rating',
title='Correlation between response rate and review scores')
#Annotate the axis lines with their means
axes[1].axhline(host_review['Review Scores Rating'].mean(),c='black')
axes[1].text(3,90,'Mean: {}'.format(round(host_review['Review Scores Rating'].mean())))
axes[1].axvline(host_review['Host Response Rate'].mean(),c='black')
axes[1].text(80,30,'Mean: {}'.format(round(host_review['Host Response Rate'].mean())))
# Add the legend to the second subplot
plt.legend()
plt.savefig("num_reviews vs prices");
For the first subplot, the data points cluster around the intersection between means of both number and reviews and price. This may indicate that when the price is low, there are more number of listings, which seems logical due to the law of supply and demand. Also, we can see that there are many data points that go along its mean axis, which may demonstrate that the price or the number of reviews are not affected by another factor. Also, the plot also gives us information about the average price and number of reviews of Airbnb in NYC.
For the second subplot, we could easily see that there are two distinctive cluster of host response rate, include those of who will response a fews days or more and those of the remaining. We may think that a host who responds late may receive low score rating but it may be not true in this case where hosts responding a few days or more have as high rating score as those responding within a day or sooner. Therefore, host response rate may not affect the review score rating. Finally, the plot also gives us informtion about the average host response rate and review scores rating.
Next step, we are going to focus on how other features affect the price. For instance, the plot below shows the correlation between number of rooms and price.
# Display the House Dataframe
house_df.head()
# Note: the Money DataFrame is displayed in Scatter plot section
# Average number of rooms and price
house_money = house_df.merge(money_df, how='inner', left_index=True, right_index=True)
house_money['Average rooms'] = house_money[['Bathrooms','Bedrooms','Beds']].mean(axis=1)
std_price = (house_money['Price'] - house_money['Price'].mean()) / house_money['Price'].std()
house_money['Std Price'] = std_price
join = sb.jointplot('Average rooms','Std Price', data=house_money,
kind='reg', height=8, ratio=4, color='g',space=0.5)
join = join.annotate(stat.pearsonr)
plt.gca().set(xlabel="Average number of rooms", ylabel='Price',
title='Correlation between number of rooms and price');
According to the plot above, we could see that there is a good correlation between number of rooms and price as what we expected. It also shows that the price distribution is skewed to the left, indicating that the mean price is higher than the median one. The mean price may be affected by some outliers that are extremely high, which is fair enough for one of the most expensive cities in the world. Besides, the majority of NYC Airbnb listings has 1 room, which make sense for one of the most populous cities around the world.
This plot shows the distribution of price of several areas in NYC.
# Display the Location DataFrame
location_df.head()
# Note: the Money DataFrame is displayed in Scatter plot section
# Price distribution of each location (boxplot)
def function(x):
location_money = location_df.merge(money_df, how='inner', right_index=True, left_index=True)
location = location_money['Neighbourhood Cleansed'].value_counts()
neighbor = location[location.values > x].copy()
price_dist = {i:[0,[]] for i in neighbor.index.values}
for each in neighbor.index.values:
data = location_money[location_money['Neighbourhood Cleansed'] == each]['Price'].values
price_dist[each][0], price_dist[each][1] = np.mean(data), data
# Sort by the mean value of each distribution
price_dist_sorted = sorted(price_dist.items(), key=lambda x: x[1][0])
sorted_data = {price_dist_sorted[i][0]:price_dist_sorted[i][1][1] for i in range(len(price_dist))}
plt.figure(figsize=(17,7), dpi=100)
plt.boxplot(sorted_data.values(),showfliers=False,whis=2)
plt.xticks(np.arange(1,len(sorted_data)+1),sorted_data.keys(),rotation=75);
plt.gca().set(xlabel='Neighbourhoods',ylabel='Price', title='Distribution of price for several areas in NYC');
dropBox = Dropdown(options=list(np.arange(0,101,10)),value=10,description='The least number of listings')
interact(function, x=dropBox);
embed_minimal_html('export.html', views=[dropBox], title='Widgets export')
We could see that mean, interquantile range, and range are generally increasing from Washington Heights to Fashion District. This does make sense because the center of NYC is Manhattan, which is the most populous area among other counties. Therefore, if you have any geographical knowledge about NY, then you could realize that any neighbor that falls into this area, such as Midtown, Soho, Fashion District, usually has a higher renting price. I will construct the NYC maps later to made it clearer for you to interpret.
Also, I include the ipywidget to let user choose a number k to display the neighborhoods that have at least k number of listings. Unfornately, the users would not be able to interact with this widget when the file is converted to html. I suggest if you want to interact with the ipywidget, download my source code as ipynb file.
This section will give you some interesting information about where the price and review scores of listings locates in NYC
# Display the Kaggle dataset
ny_listing_kaggle.drop(columns=['last_review','reviews_per_month']).head()
# Read the shape file, which returns the GeoDataFrame
path = "../Airbnb-Datasets/nyc_borough/nyc_boroughs.shp"
borough = geopandas.read_file(path)
# Convert longitude and latitude to Point object
# Pass all attribures, including crs and geometry, to GeoDataFrame
crs = {'init':'epsg:4326'}
geometry = [Point(xy) for xy in zip(ny_listing_kaggle['longitude'],ny_listing_kaggle['latitude'])]
geo_df = geopandas.GeoDataFrame(ny_listing_kaggle, crs = crs, geometry = geometry).query('0 < price < 500')
# Create a subplot with 2 rows and 1 column
fig, ax = plt.subplots(2,1, figsize=(10,15), dpi=100)
# Plot the map of NYC on the first axes
borough.plot(ax=ax[0], alpha=0.4, color='grey')
# Plot the data points from GeoDataFrame with column "Review Scores Rating" as value
# Pass in parameters "scheme" and "k" to create intervals for legend
geo_df.plot(ax=ax[0], alpha=0.6, markersize=5, column='room_type',
cmap='Blues', scheme='equal_interval', k=9,legend=True, legend_kwds={'loc':'upper left'});
# Set the labels and title
ax[0].set(xlabel='Longitude',ylabel='Latitude',title='Review-score locations in NYC')
# Plot the map of NYC on the second axes
borough.plot(ax=ax[1], alpha=0.4, color='grey')
# Plot the data points from GeoDataFrame with column "Price" as value
# Pass in parameters "scheme" and "k" to create intervals for legend
geo_df.plot(ax=ax[1], alpha=0.6, column='price', cmap='Blues',
markersize=5, scheme='equal_interval', k=9,legend=True, legend_kwds={'loc':'upper left'})
# Set the labels and title
ax[1].set(xlabel='Longitude',ylabel='Latitude',title='Price locations in NYC');
In this map, we will explore the location of room types and price in NYC by using the Kaggle dataset. For the map of room type, private room, entire home/apt, and shared room are encoded as 0,1, and 2, respectively. Since I do not know how to adjust the legend so I think it is better to mention the legend here for clarity. Private rooms appears the most in lower Manhattan area; entire home/apt appears the most in upper Manhattan, Queens, and Brooklyn; shared rooms are distributed evenly in NYC.
For the map of price, we will only deal with observations whose price is in range from 0 to 500 dollars, since there are some extreme outliers that affect our analysis. we could see that the prices are higher in lower and mid Manhattan and lower when they move to upper Manhattan, which makes sense because the lower areas are the hub of financial centers and headquarters, particularly Wall Street.
After we know some interesting pieces of geographical information of NYC Airbnb listings related to price and review scores, we explore some information about their property. In this plot, we will look at how many number of listings there are for each type of room. Based on that, we will see how the price distributes given the room type and property type
# Display the House DataFrame
house_df.head()
# Note: the Money DataFrame is displayed in Scatter plot section
# Can use a bar chart: in each chart (neighbourhood: Manhattan, Brooklyn,Queens)
# display the number of counts of every room type -> Each chart has 3 sub-columns (entire, private, shared)
plt.figure(figsize=(8,4), dpi=100)
room_counts = house_money['Room Type'].value_counts()
sb.countplot(x='Room Type',data=house_money)
plt.title("Number of rooms for each type in New York City")
for index,value in enumerate(room_counts.values):
plt.annotate(value,(index,value+100),ha='center');
# Number of property type in each location (bar-chart)
plt.figure(figsize=(15,3), dpi=200);
property_type, room_type, price = house_money['Property Type'], house_money['Room Type'], house_money['Price']
table = pd.pivot_table(house_money,columns='Property Type',index='Room Type',values='Price',aggfunc=np.mean).fillna(0)
sb.heatmap(table, linewidths=0.5, cmap='YlGnBu')
plt.xticks(rotation=75), plt.yticks(rotation=45)
plt.title("The price distribution given the room type and the property type");
Based on the first graph, we see that Entire home/apt accounts for the highest number of rooms in NYC, then private room, and the last one is shared room. This may indicate that the entire home/apt may be the most profitable among three types, according to the law of supply and demands where price and supply are positively correlated. In the next plot, entire home/apt tends to have higher prices than the other types. The plot also raises an interesting insight where serviced apartment and vacation home are the most expensive property type. If a host plans to open Airbnb house of any type, serviced apartment, vacation home, or condo may be good options.
In this plot, we will see how Airbnb becomes popular in NYC, which is one of primary purposes of this tutorial, based on number of listings and number of reviews over the months from 2009 to 2017.
# Display Host DataFrame
host_df.head()
# Note: the Review DataFrame is displayed in Scatter plot section
listing = host_df.groupby(by='Host Since').agg({'Host Listings Count':np.sum})
total_num = [np.sum(listing.iloc[:i].values) for i in range(len(listing))]
listing['Total listings'] = total_num
reviews = pd.DataFrame()
reviews['Time'], reviews['Reviews'] = pd.to_datetime(review_df['First Review']), review_df['Number of Reviews']
total_reviews = reviews.groupby('Time').agg({'Reviews':np.sum}).sort_index()
data = (listing.merge(total_reviews, how='inner', right_index=True, left_index=True)
.drop(columns=['Host Listings Count']).rename(columns={'index':'Time'}))
def function(year, include):
fig, ax = plt.subplots(figsize=(15,5), dpi=200)
ax2 = ax.twinx()
if include:
ax.plot(data.index,data['Total listings'], label='Number of listings', color='r')
ax2.plot(data.index,data['Reviews'], alpha=0.5, label='Number of reviews',color='b')
else:
test = data[data.reset_index().apply(lambda x: x['index'].year == year, axis=1).values]
ax.plot(test.index,test['Total listings'], label='Number of listings', color='r')
ax2.plot(test.index,test['Reviews'], alpha=0.5, label='Number of reviews',color='b')
plt.title("The popularity of Airbnb in NYC throughout the years")
plt.xlabel("Years"), ax2.set_ylabel('Number of reviews'), ax.set_ylabel('Number of listings')
years = [2010,2011,2012,2013,2014,2015,2016,2017]
box = Dropdown(options=years, values=2012, description='Year')
check = Checkbox(value=True, description='Include all years')
interact(function, year=box, include=check);
Based on the number of listings, we see that it continuously increases through all years. We assume that the people will not close their Airbnb houses during all years, thus the number of listings is incremented every day, resulting in non-decreasing number of listings. For the number of reviews, it functuates and is unpredictable. However, the number of reviews increases in general. The user can also interact with the graph to see the number of reviews and listings of each year.
Also, I include the ipywidget to let user choose a year to display the number of reviews and listings in that year. Unfornately, the users would not be able to interact with this widget when the file is converted to html. I suggest if you want to interact with the ipywidget, download my source code as ipynb file.
After we explore the dataset and understand which features are not suitable or suitable to the training set for ML model, we will predict the price based on the chosen features listed below. Then, we will explore what value of k gives us the optimal model
# Import necessary libraries
from sklearn.neighbors import KNeighborsRegressor
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
# Add a new column to the original dataset
ny_listings['Average Rooms'] = house_money['Average rooms']
# Define a list of features and label for training dataset
features = ['Neighbourhood Group Cleansed','Property Type','Room Type','Host Response Rate',
'Availability 365','Average Rooms','Calculated host listings count','Price']
# Load dataset
data_ml = ny_listings[features].dropna()
# Create a training set
X_train_dict = data_ml[features[:-1]].to_dict(orient='records')
y_train = data_ml['Price']
# Create DictVectorizer to encode the training set
vector = DictVectorizer(sparse=False)
vector.fit(X_train_dict)
X_train = vector.transform(X_train_dict)
# Create a new input set to predict the price
X_new = pd.Series(index=features)
X_new_dict = {
'Neighbourhood Group Cleansed':'Manhattan','Property Type':'Apartment','Room Type':'Private room',
'Average Rooms':2.3,'Calculated host listings count':10,'Availability 365':365,'Host Response Rate':95
}
# Encode the new data point
X_new = vector.transform(X_new_dict)
# Create a Scaler to normalize the training and input set
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_new_sc = scaler.transform(X_new)
# Create a K-nearest-neighbor model and fit the model
model = KNeighborsRegressor(n_neighbors=30)
model.fit(X_train_sc, y_train)
pred_y = model.predict(X_new_sc)
# Output the predicted price given an input
print("The predicted price is $%d." % pred_y[0]);
The Machine Learning model have 7 features for training dataset, including neighbourhood, property type, room type, avergae rooms, host listings count, annual availability, and host response rate. Thanks to exploratory data analysis, we expect that these features correlates with the price, the label we want to predict. For instance, the predicted price given the values of features is $181.
# Compute errors with Cross validation set
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
def find_k_cv(k):
# specify the pipeline
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=k)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
scores = cross_val_score(pipeline, X_train_dict, y_train, cv=7 , scoring="neg_mean_squared_error")
return -np.mean(scores)
k_series = pd.Series(np.arange(1,21), index=np.arange(1,21))
scores_cv = k_series.apply(lambda x: find_k_cv(x))
# Compute errors with training set
y_train_pr = []
def find_k_train(k):
model = KNeighborsRegressor(n_neighbors=k)
model.fit(X_train_sc, y_train)
y_train_pred = model.predict(X_train_sc)
y_train_pr.append(y_train_pred)
return np.mean(np.abs(y_train - y_train_pred))
k_series = pd.Series(np.arange(1,21), index=np.arange(1,21))
score_train = k_series.apply(lambda x: find_k_train(x))
plt.figure(figsize=(5,3), dpi=100)
ax = plt.axes([0,0,1,1])
ax.plot(scores_cv/100, label='Cross validation',color='blue')
ax.plot(score_train, label='Training', color='red')
ax.set(title='The errors given k', xlabel='Value k', ylabel='Error')
plt.legend();
print("For training error, the optimal k is {}".format(score_train.idxmin()))
print("For cross validation error, the optimal k is {}".format(scores_cv.idxmin()))
Based on the plot above, we will choose to have k value of 3 to reach the optimal model. However, the cross validation error is still extremely high (even we scale down the error by dividing it by 100), compare to the training error. Yet, the cross validation error decreases as k increases because the higher k are more likely to include the labels that are close to the true label . On the other hand, the training error increases because higher k is more likely to include outliers that can affect our prediction. Furthermore, both lines become flat quickly as k increases. Since our model is subject to high variance, one way we can reduce the error is to increase the training set.
# Display the Text DataFrame
text_df.head()
# Create a mask
mask = np.array(Image.open("img/color.jpg"))
# Set a list of stopwords
stopwords = set(STOPWORDS)
stopwords.update(['Subway','Central Park','Comfortable','Expensive'])
# Add all the texts together
host_text = "".join(review for review in text_df['Description'].values)
# Create a WordCloud object
word = WordCloud(max_font_size=40, max_words=len(host_text),background_color='white', stopwords=stopwords).generate(host_text)
# create coloring from image
image_colors = ImageColorGenerator(mask)
# Display the wordcloud and adjust the color
plt.figure(figsize=(14,14))
plt.axis('off')
plt.imshow(word.recolor(color_func=image_colors), interpolation='bilinear');
# Join all the text together
customer_text = "".join(review for review in text_df['Notes'].values)
# Create a WordCloud Object
word = WordCloud(max_font_size=40, max_words=len(host_text),background_color='white').generate(host_text)
# Display the wordcloud
plt.figure(figsize=(14,14))
plt.axis('off')
plt.imshow(word, interpolation='bilinear');
When we look at these colorful wordcloud, we could see that both hosts and customers may have some similar words, such as Central Park, New York City, living room, apartment, or Time Square, and so on. These words are the most common words, according to their size on the wordcloud. Thus, we can imply that hosts are taking customers' reviews seriously and try their best to improve their quality accordingly. Also, the most comon words are the most popular destinations in NYC, which makes sense.
boston_listings['city'] = "Boston"; ny_listings['City'] = "New York City"; san_listings['city'] = "San Francisco"
boston_price = boston_listings[["city",'price']].dropna()
boston_price['price'] = boston_price.apply(lambda x: float(x['price'].replace("$","").replace(",","")), axis=1)
boston_price = boston_price[(0 < boston_price['price']) & (boston_price['price'] < 400)]
nyc_price = ny_listings[['City','Price']].rename(columns={"City":"city","Price":'price'}).dropna()
nyc_price = nyc_price[(nyc_price['price'] > 0) & (nyc_price['price'] < 400)]
san_price = san_listings[['city','price']].dropna()
san_price['price'] = san_price.apply(lambda x: float(x['price'].replace("$","").replace(",","")), axis=1)
san_price = san_price[(0 < san_price['price']) & (san_price['price'] < 400)]
plt.figure(figsize=(10,5), dpi=100)
data = pd.concat([boston_price,nyc_price,san_price])
sb.boxplot(data=data, x='city',y='price')
plt.title("Price among 3 cities");
To reduce the effect of outliers, we will only analyze the data whose price is in range of 0 to 500 dollars. It seems that the price of three cities are roughly similar to each other. For each city, its median value is approximately 150 dollars, and 75% of its data are in range of 90 to 200 dollars. These prices are considered high, which makes sense since Boston, New York, and San Francisco are ones of the most expensive cities in the world.
x = np.arange(3)
width = 0.2
san = san_listings['room_type'].value_counts() / len(san_listings)
boston = boston_listings['room_type'].value_counts() / len(boston_listings)
nyc = ny_listings['Room Type'].value_counts() / len(ny_listings)
plt.figure(figsize=(10,5), dpi=100)
plt.bar(x,san.values, width=width, label='San Francisco')
plt.bar(x+width, boston.values, width=width, label='Boston')
plt.bar(x+2*width, nyc.values, width=width, label='New York')
plt.gca().set_xticks(x+width)
plt.gca().set_xticklabels(['Entire home/apt','Private room','Shared room'])
plt.title("Room type among three cities")
plt.legend();
Since each city has different number of listings, we will compute the percentage of each room type for each city. Once again, the percent of each room of each city is roughly similar to those of other cities. Among three types, entire home/apt has the highest percent, while shared room has the lowest. As we already talked above, it is more profitable for Airbnb owners to lend entire home/apt than other types.
Through this exploratory data analysis and visualization project, we gained several interesting insights into the Airbnb rental market. Below we will summarise the answers to the questions that we wished to answer at the beginning of the project:
As I mentioned above, there is postive correlation between price and host response rate, as well as price and average number of rooms. Therefore, possible suggestions for hosts are that they should increase their response rate and renovate their houses such as adding more side beds or utilities in the kitchen. On the other side, rating scores and number of reviews may not have relationship with rental price. Two former features will be included in the training set for Machine Learning model.
The rental distribution can be seen in the boxplot when we compare NYC dataset with others or the first plot of this project. The mean price is 173 dollars, which is similar to those of big cities, such as San Franciso or Boston. The distribution is skewed to the right, which indicates that the mean is higher than the medium.
Manhattan has the most expensive rentals compared to the other boroughs. Prices are higher for rentals closer to city hotspots. However, there are a few outliers in Bronx, Staten Island and Brooklyn that defy the above hypothesis.
The most popular room type are Entire home/apt.
The demand (assuming that it can be inferred from the number of reviews and number of listings) shows a seasonal pattern - demand increases from January to October, then drops slightly in November and December. In general, the demand for Airbnb listings has been steadily increasing over the years.
The kNN model is used to fit the training data. After running data visualization, I come up with seven features that are related to the rental price. However, training and test errors are high for this model, which suggests that either the prices of Airbnb apartments are rather arbitrary or there exists other more influential factors that are not included in this dataset. One of my solutions to reduce errors is collecting more training data.
As we can see in the two WordCloud, the customers care about the living room, kitchen, transportation, and the location of a house. A possible suggestion for hosts can be that they should pay attention to customers' reviews and make reasonable changes to satisfy them.
In addition to the price we discussed above, NYC's percents of room types are approximately close to those of Boston or San Francisco. This makes sense since Airbnb owners gain the most profit if they lend entire home/apt.
The first obstacle when I dealed with this dataset is that it has a lots of missing data. I had to spend quite much time trying to use the most approriate technique of imputation to fill the missing data, and the solution I used is mean imputation. The second difficulty is finding the suitable library to implement Map. Among Basemap, Folium, and Geopandas, I chose geopandas because it is more convenient and efficient to use since I can map the data by using the DataFrame directly, not having to learn many more complicated syntax. Finally, my current problem is finding a way to reduce the error for Machine Learning model.
Besides gaining interesting insights into the Airbnb rental market in NYC, I acquired several technical and soft skills along the way. Dealing with multiple data formats helped me strengthen our skills in data manipulation and cleaning. I learned how to work on different Python libraries, particularly Scikit-learn to train Machine Learning model. I also learnt how to effectively deploy Github and other version control systems while working on a project.
I want to expand our analysis to multiple cities and compare patterns and trends amongst these cities. From the insights I have derived, I would also like to build predictive models using different features from the dataset. Lastly, I hope to implement the visualizations and techniques used in this project to many other fields and datasets.